Choose a data from UN data or OECD data.
Create an R Notebook of a data analysis containing the following
and submit the rendered file (e.g., w5_g123456.nb.html) in
Moodle.
Create an R Notebook using the R Notebook Template in Moodle and
save it as, for example, w5_g123456.Rmd.
Edit author with name, ID, and title.
Contents should include the following:
A short abstract
Information of data including data name, description and link to reach the data
Several charts: a bar graph or a column graph, a histogram, a line graph, a scatter plot
Observations or questions for visualizations
Use left_join, if possible. (a challenge, not required)
Run each code block and preview to create, for
example, w5_g123456.nb.html.
Or, Run All under Run and Preview to check in your web browser!
You can also view your file
(e.g. w5_g123456.nb.html) by opening it from your web
browser.
Submit your R Notebook file (w5_g123456.nb.html) in Moodle (Week Five Assignment).
Due Sunday 28, January 2024, 11:59 PM
The title can be in the title in YAML.
We study …..
Install a package countrycode first.
library(tidyverse)
library(WDI)
library(readxl)
library(countrycode)
Do not forget to add observations and questions.
Let me illustrate steps using the following indicators. Issues are not taken from the analysis of the one who chose these indicators.
UN Data - Government expenditure on education as % of GDP: Link,
UN Data - Gross Enrollment Ratio - Tertiary Education: Link
When download the data, choose country codes from ‘Select columns’. The first data adds iso3c codes and the second adds UN number codes.
It is helpful to add links to the site.
library(tidyverse)
Create data folder if you do not have it under Files.
dir.create("data")
If you do not have wdicache.rds in your data folder,
run the following two code chunks.
wdicache <- WDIcache()
wdi_country_extra <- wdicache$country |> select(iso3c, region, income, lending)
Go to the data link. Add Reference Area Code from Select column on top. Click Update. You will see that iso3c is added in the first column. Then download a CSV file (Comma). You get ‘UNdata_Export_20240131_024758033.csv’.
Next go the the data link. Similarly, add Country or Area Code from Select column on top. Click Update. You will see that UN code is added in the first column. Then download a CSV file (Comma). You get ‘data/UNdata_Export_20240131_024820801.csv’.
Place the downloaded data ‘UNdata_Export_20240131_024758033.csv’ and ‘UNdata_Export_20240131_024820801.csv’ in the data folder.
df_ed_gov_exp <- read_csv("data/UNdata_Export_20240131_024758033.csv")
Rows: 3195 Columns: 7── Column specification ─────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Reference Area Code, Reference Area, Sex, Age group, Units of measurement
dbl (2): Time Period, Observation Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_ed_gov_exp
df_ed_gov_exp |> str()
spc_tbl_ [3,195 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Reference Area Code : chr [1:3195] "AFG" "AFG" "AFG" "AFG" ...
$ Reference Area : chr [1:3195] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Time Period : num [1:3195] 2014 2013 2012 2011 2010 ...
$ Sex : chr [1:3195] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
$ Age group : chr [1:3195] "Not applicable" "Not applicable" "Not applicable" "Not applicable" ...
$ Units of measurement: chr [1:3195] "Percent" "Percent" "Percent" "Percent" ...
$ Observation Value : num [1:3195] 4.8 4.54 3.13 4.09 4.51 ...
- attr(*, "spec")=
.. cols(
.. `Reference Area Code` = col_character(),
.. `Reference Area` = col_character(),
.. `Time Period` = col_double(),
.. Sex = col_character(),
.. `Age group` = col_character(),
.. `Units of measurement` = col_character(),
.. `Observation Value` = col_double()
.. )
- attr(*, "problems")=<externalptr>
Observation
dbl,
and should be in integer.Select the categorical vaiables we want to check the values.
df_ed_gov_exp |> select(`Time Period`, Sex, `Age group`, `Units of measurement`) |> lapply(unique)
$`Time Period`
[1] 2014 2013 2012 2011 2010 1982 1981 1980 1979 1975 2007 2006 2005 2004 2003 2002 2001
[18] 2000 1999 1998 1997 1996 1995 1994 2008 2009 1987 1986 1985 1990 1989 1984 1983 1978
[35] 1977 1976 1993 1992 1991 1988 2015
$Sex
[1] "Not applicable"
$`Age group`
[1] "Not applicable"
$`Units of measurement`
[1] "Percent"
We do not need the categorical variable with only one value.
With ‘select’ we can change the names of the variables as well.
Change Time Period to integer values. Then delete
Time Period column. We do not need quotation marks for
column names, but if there is a space in the column name, we need to
surround by back ticks.
df_gov_exp <- df_ed_gov_exp |>
select(iso3c = `Reference Area Code`, country = `Reference Area`, `Time Period`, gov_exp = `Observation Value`) |> mutate(year = as.integer(`Time Period`), .after = country) |>
select(-`Time Period`) |> arrange(country, iso3c, year)
df_gov_exp
Now, we reached a clean data. We do the same to the second data.
df_tertiary_ratio <- read_csv("data/UNdata_Export_20240131_024820801.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)Rows: 2066 Columns: 8── Column specification ─────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Country or Area, Subgroup, Source, Unit, Value
dbl (3): Country or Area Code, Year, Value Footnotes
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_tertiary_ratio
df_tertiary_ratio |> str()
spc_tbl_ [2,066 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Country or Area Code: num [1:2066] 84 84 84 84 149 149 149 149 149 149 ...
$ Country or Area : chr [1:2066] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Subgroup : chr [1:2066] "Female" "Female" "Male" "Male" ...
$ Year : num [1:2066] 2004 2003 2004 2003 2004 ...
$ Source : chr [1:2066] "UNESCO_UIS Database_Sep2007" "UNESCO_UIS Database_Sep2007" "UNESCO_UIS Database_Sep2007" "UNESCO_UIS Database_Sep2007" ...
$ Unit : chr [1:2066] "Number" "Number" "Number" "Number" ...
$ Value : chr [1:2066] "5637" "5344" "22011" "20867" ...
$ Value Footnotes : num [1:2066] NA 1 NA NA NA NA NA NA NA NA ...
- attr(*, "spec")=
.. cols(
.. `Country or Area Code` = col_double(),
.. `Country or Area` = col_character(),
.. Subgroup = col_character(),
.. Year = col_double(),
.. Source = col_character(),
.. Unit = col_character(),
.. Value = col_character(),
.. `Value Footnotes` = col_double()
.. )
- attr(*, "problems")=<externalptr>
Besides the ones we noticed in the first data, surprisingly, the Value is in character. Let me check.
df_tertiary_ratio$Value |> unique()
[1] "5637" "5344" "22011" "20867" "32897" "27180" "26124"
[8] "25069" "24017" "23032" "20117" "16420" "16036" "15790"
[15] "16108" "15470" "429912" "365409" "325551" "351043" "174"
[22] "161" "149" "133" "168" "170" "157" "134"
[29] "5009" "3060" "7557" "4785" "25" "17" "12"
[36] "…" "8" "4" "0" "1234534" "1253534" "1202086"
[43] "1136287" "1066341" "987167" "892579" "847903" "824649" "782421"
[50] "700592" "613715" "48060" "44002" "39544" "40516" "37523"
[57] "33731" "32883" "38569" "35319" "34059" "34958" "31181"
[64] "29063" "27801" "1259" "1026" "992" "962" "986"
[71] "957" "780" "847" "678" "680" "630" "642"
[78] "621" "666" "553069" "543347" "544603" "546544" "470652"
[85] "458538" "457686" "461991" "459651" "461374" "465666" "398037"
[92] "386594" "387950" "131144" "127206" "121778" "117881" "137222"
[99] "133122" "126534" "113266" "111316" "108024" "105854" "127447"
[106] "128107" "126359" "60131" "56431" "54117" "53661" "50203"
[113] "46884" "42066" "68503" "66339" "67039" "67814" "70490"
[120] "70193" "65717" "12781" "11684" "11810" "6630" "6060"
[127] "6840" "7269" "4418" "304935" "259165" "280951" "273799"
[134] "297265" "234797" "228860" "606665" "562199" "596384" "581540"
[141] "581272" "491904" "480364" "5640" "5838" "4786" "2339"
[148] "2236" "2129" "299984" "289517" "278862" "263271" "246071"
[155] "231169" "216261" "228524" "217843" "209788" "200273" "191924"
[162] "180692" "171086" "211974" "207756" "199796" "194843" "189813"
[169] "185908" "185306" "177573" "178354" "174736" "172139" "169452"
[176] "169840" "166482" "507" "342" "215" "185" "3920"
[183] "3719" "3346" "15838" "15034" "12938" "1080" "1068"
[190] "880" "874" "639" "620" "539" "1254" "1217"
[197] "940" "5449" "4735" "3754" "3597" "2684" "2436"
[204] "5501" "5462" "4618" "4054" "3648" "3096" "2408720"
[211] "2254291" "2022907" "1753483" "1562153" "1365159" "1866307" "1740131"
[218] "1559198" "1372262" "1219175" "1091802" "825" "789" "735"
[225] "523" "423" "540" "598" "375" "347" "290"
[232] "235" "246" "210" "258" "3341" "3247" "2941"
[239] "2886" "2830" "2580" "2442" "1682" "1670" "1605"
[246] "1627" "1649" "1404" "1263" "123940" "119890" "121709"
[253] "123261" "139081" "149671" "160614" "113969" "108578" "108804"
[260] "105133" "107925" "111650" "109463" "8587" "4228" "4078"
[267] "3946" "3130" "2251" "19355" "14640" "14122" "11589"
[274] "9192" "7627" "4673" "4353" "3795" "3211" "1712"
[281] "1643" "1488" "12216" "11353" "8120" "7335" "4577"
[288] "4489" "3549" "17875" "14196" "12457" "9228" "6950"
[295] "5547" "38935" "31174" "30753" "22782" "18466" "16561"
[302] "39448" "32541" "31538" "30138" "60416" "51362" "49780"
[309] "47569" "748927" "707689" "678935" "683311" "577784" "547144"
[316] "533226" "537340" "1994" "1597" "1172" "924" "367"
[323] "409" "1916" "1439" "1043" "886" "351" "392"
[330] "291" "280" "99" "100" "1027" "1002" "5296"
[337] "5227" "1310" "1262" "755" "916" "885" "9158"
[344] "8819" "6642" "5190" "5016" "319529" "278712" "271278"
[351] "247809" "213254" "212448" "344165" "302103" "295836" "273800"
[358] "238923" "238504" "9944290" "8512217" "6657178" "11391356" "10904827"
[365] "8529039" "628006" "570760" "508128" "509251" "506344" "481970"
[372] "457595" "595588" "541814" "478552" "480494" "470899" "452115"
[379] "420349" "768" "737" "299" "277" "1011" "970"
[386] "415" "372" "1969" "1923" "1599" "3716" "10487"
[393] "10241" "11804" "11913" "8462" "60144" "59083" "41608"
[400] "40467" "41722" "32923" "31012" "50573" "49682" "37891"
[407] "36816" "37460" "28731" "27749" "25398" "71283" "64749"
[414] "59028" "54637" "51021" "50726" "56973" "53509" "49531"
[421] "45777" "45163" "293089" "247063" "132543" "104036" "92824"
[428] "84826" "81558" "178769" "149453" "103454" "87226" "85197"
[435] "73848" "71905" "10442" "9990" "9044" "7636" "6923"
[442] "5945" "6075" "9636" "10859" "6291" "5011" "4469"
[449] "4767" "176965" "163409" "145451" "145747" "130298" "126289"
[456] "114991" "159342" "155449" "141550" "138738" "129746" "127406"
[463] "116233" "133376" "125628" "116844" "112698" "108290" "107644"
[470] "106957" "98879" "91502" "84902" "83506" "83732" "81518"
[477] "83013" "708" "508" "371" "324" "207" "89"
[484] "988" "626" "535" "404" "289" "101" "86"
[491] "180045" "175973" "113520" "110981" "66982" "65451" "62774"
[498] "61449" "59551" "62222" "65299" "55449" "54813" "53747"
[505] "51917" "50395" "52453" "53192" "304" "699" "606"
[512] "738" "592" "4006" "4987" "4769" "3543" "3454"
[519] "41677" "40558" "39153" "37269" "34719" "31346" "28124"
[526] "26083" "25101" "24472" "23379" "23059" "22267" "20560"
[533] "46574" "43307" "37259" "26898" "18707" "14681" "9769"
[540] "144591" "128804" "110695" "74931" "68724" "53051" "42536"
[547] "6747" "6790" "6798" "5970" "5993" "5981" "163872"
[554] "160104" "156012" "153520" "150585" "145099" "141879" "142124"
[561] "139784" "135652" "130285" "129043" "125086" "121011" "1207887"
[568] "1189125" "1166248" "1111993" "1099629" "1092477" "1095322" "979496"
[575] "971175" "952901" "917186" "932114" "922867" "916871" "2667"
[582] "4806" "293" "265" "1237" "904" "87836" "78359"
[589] "75834" "74320" "68842" "67418" "67658" "86419" "76699"
[596] "79619" "74822" "71785" "69628" "62506" "41783" "22059"
[603] "22344" "18992" "18282" "13646" "77776" "47909" "47949"
[610] "49397" "45816" "41012" "330440" "308697" "286091" "271230"
[617] "244413" "211087" "194908" "316147" "288310" "275377" "258003"
[624] "233792" "211230" "192951" "49380" "48078" "65384" "63661"
[631] "4435" "2693" "2637" "19353" "14525" "14221" "74"
[638] "72" "78" "399" "391" "421" "4917" "4524"
[645] "2361" "2409" "71999" "70265" "64763" "54212" "50850"
[652] "47373" "50875" "49612" "45726" "42400" "39770" "37627"
[659] "77739" "75662" "73466" "74555" "72062" "72573" "254657"
[666] "241753" "221363" "195854" "181062" "165404" "151471" "181355"
[673] "180424" "169090" "158532" "149487" "141667" "127926" "10020"
[680] "9491" "8497" "7322" "6383" "5985" "5260" "5509"
[687] "5219" "4850" "4262" "3801" "3682" "3202" "4641576"
[694] "4527606" "4342409" "4135343" "3801494" "3552026" "7135720" "7325330"
[701] "6952632" "6441310" "6032552" "5852434" "1595396" "1556231" "1511088"
[708] "1458354" "1293089" "2044874" "1994861" "1930341" "1717479" "1724798"
[715] "1083664" "1003920" "869978" "768433" "743951" "635973" "568092"
[722] "1042610" "951000" "844455" "798076" "825825" "768907" "740058"
[729] "153711" "149213" "108238" "98306" "92880" "271197" "263332"
[736] "209755" "190364" "178628" "102501" "103941" "101155" "97173"
[743] "91158" "86904" "80900" "84060" "84374" "80402" "79123"
[750] "75442" "73707" "70237" "174092" "168126" "167738" "169338"
[757] "153654" "146626" "142182" "136845" "133101" "133588" "130378"
[764] "117325" "109265" "104624" "1139807" "1116589" "1076253" "1042977"
[771] "1014953" "982668" "991450" "875191" "869908" "837099" "811223"
[778] "797372" "787334" "805791" "32003" "31249" "28409" "23408"
[785] "13767" "14145" "14093" "12587" "1853130" "1848396" "1815521"
[792] "1789881" "1781996" "1787068" "1760494" "2185172" "2183208" "2168879"
[799] "2176786" "2190472" "2195001" "2180262" "109528" "109692" "95056"
[806] "79582" "73098" "108295" "104414" "91133" "83106" "69092"
[813] "437742" "381704" "341628" "287144" "241722" "200067" "172938"
[820] "315439" "282745" "261444" "232671" "203929" "170254" "151011"
[827] "38584" "34186" "32972" "31266" "64214" "63929" "61657"
[834] "57750" "1188250" "1178285" "1168159" "1127541" "1070021" "998297"
[841] "923158" "2036625" "2045146" "2041983" "2002358" "1933477" "1839583"
[848] "1713230" "27127" "23701" "23955" "23414" "22002" "21867"
[855] "11503" "13165" "13198" "13568" "12777" "10453" "121828"
[862] "110868" "108720" "110875" "96780" "80469" "66619" "98632"
[869] "94356" "92408" "98370" "93728" "80215" "64603" "19523"
[876] "12815" "10053" "8201" "6117" "4798" "3880" "27901"
[883] "20945" "18064" "14817" "10628" "9351" "8196" "82575"
[890] "79473" "73403" "67991" "63524" "57850" "50539" "48131"
[897] "48183" "45541" "42509" "39259" "33387" "31503" "87473"
[904] "80837" "77835" "75687" "69557" "60007" "56850" "78257"
[911] "73798" "66215" "67264" "64461" "56007" "56172" "4503"
[918] "3745" "2905" "3158" "2764" "2579" "3415" "2363"
[925] "2100" "1818" "1706" "1467" "18871" "3987" "25236"
[932] "16817" "192721" "184624" "162833" "141041" "149995" "182307"
[939] "174522" "161770" "149019" "158479" "142" "119" "390"
[946] "321" "117493" "109592" "100609" "89947" "81336" "73092"
[953] "64456" "77912" "73064" "66997" "58841" "54587" "48812"
[960] "42963" "1609" "1640" "1566" "1346" "1433" "1437"
[967] "1399" "1187" "1178" "1313" "10019" "10054" "9589"
[974] "7533" "6236" "3883" "3450" "13401" "14761" "16683"
[981] "12887" "7760" "3588" "4008" "27987" "26594" "25621"
[988] "24691" "22463" "20325" "19359" "21377" "20043" "20003"
[995] "20019" "17783" "16597" "15782" "21145" "19945"
[ reached getOption("max.print") -- omitted 757 entries ]
We can assume that these are numbers.
df_tertiary_ratio |> select(Subgroup, Year, Source, Unit, `Value Footnotes`) |> lapply(unique)
$Subgroup
[1] "Female" "Male" NA
$Year
[1] 2004 2003 2002 2001 2000 1999 2005 NA
$Source
[1] "UNESCO_UIS Database_Sep2007" NA
$Unit
[1] "Number" NA
$`Value Footnotes`
[1] NA 1 2 3 4 5
df_tertiary <- df_tertiary_ratio |> select(un_code = `Country or Area Code`, country = `Country or Area`, sex = Subgroup, Year, tertiary = Value) |> mutate(year = as.integer(Year), .before = sex, tertiary = as.numeric(tertiary)) |> select(-Year)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `tertiary = as.numeric(tertiary)`.
Caused by warning:
! NAs introduced by coercion
df_tertiary
Here, we check the list of countries in two data sets.
setdiff(unique(df_gov_exp$country), unique(df_tertiary$country))
[1] "Bolivia"
[2] "Brunei Darussalam"
[3] "Côte d'Ivoire"
[4] "Democratic Republic of the Congo"
[5] "Ecuador"
[6] "Egypt"
[7] "Gambia"
[8] "Germany"
[9] "Haiti"
[10] "Holy See"
[11] "Hong Kong Special Administrative Region of China"
[12] "Iran, Islamic Republic of"
[13] "Lao People's Democratic Republic"
[14] "Libyan Arab Jamahiriya"
[15] "Macao Special Administrative Region of China"
[16] "Micronesia, Federated States of"
[17] "Moldova"
[18] "Puerto Rico"
[19] "Republic of Korea"
[20] "Saint Vincent and the Grenadines"
[21] "Senegal"
[22] "Serbia"
[23] "Singapore"
[24] "South Sudan"
[25] "Sri Lanka"
[26] "Syrian Arab Republic"
[27] "The former Yugoslav Republic of Macedonia"
[28] "Timor-Leste"
[29] "Turkmenistan"
[30] "United Kingdom of Great Britain and Northern Ireland"
[31] "United Republic of Tanzania"
[32] "Venezuela (Bolivarian Republic of)"
[33] "Viet Nam"
setdiff(unique(df_tertiary$country), unique(df_gov_exp$country))
[1] "Brunei" "Cayman Islands"
[3] "Côte d’Ivoire" "Gambia The"
[5] "Gibraltar" "Hong Kong SAR"
[7] "Iran" "Korea Rep"
[9] "Lao PDR" "Libya"
[11] "Macau SAR" "Macedonia"
[13] "Nauru" "Netherlands Antilles"
[15] "Niue" "Palestinian Territory, Occupied"
[17] "Republic of Moldova" "Serbia and Montenegro"
[19] "St Vincent and the Grenadines" "Suriname"
[21] "Tanzania" "Timor Leste"
[23] "Tokelau" "United Kingdom"
[25] "Uzbekistan" "Venezuela"
[27] "Vietnam" "Footnote"
[29] "UIS estimation." "National Estimation."
[31] "Figure based on: National estimation." "Category not applicable."
[33] "Figure based on: UIS estimation."
We can observe that many country names do not agree.
Country Names are different.
Next check, if we can properly assign iso3c to the
second data. The following shows that excep the 8 countries below,
iso3c are assigned.
df_tertiary |>
mutate(iso3c = countrycode(country, "country.name", "iso3c"), .after = un_code) |>
filter(is.na(iso3c)) |> distinct(country)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(country, "country.name", "iso3c")`.
Caused by warning:
! Some values were not matched unambiguously: Category not applicable., Figure based on: National estimation., Figure based on: UIS estimation., Footnote, National Estimation., Netherlands Antilles, Serbia and Montenegro, UIS estimation.
Check whether two countries appear in the first data.
df_gov_exp |> filter(country %in% c("Netherlands Antilles", "Serbia and Montenegro"))
Let us assign iso3c and delete the rows which we cannot
assign iso3c.
df_tertiary1 <- df_tertiary |>
mutate(iso3c = countrycode(country, "country.name", "iso3c"), .after = un_code) |>
select(-country, -un_code) |> filter(!is.na(iso3c))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso3c = countrycode(country, "country.name", "iso3c")`.
Caused by warning:
! Some values were not matched unambiguously: Category not applicable., Figure based on: National estimation., Figure based on: UIS estimation., Footnote, National Estimation., Netherlands Antilles, Serbia and Montenegro, UIS estimation.
df_tertiary1
Now we combine these two datasets. We reached a reasonably clean combined dataset.
df_un_ed_combined <- df_gov_exp |> full_join(df_tertiary1, by = c("iso3c", "year"))
df_un_ed_combined
df_un_ed_combined |> filter(country %in% c("Japan", "Viet Nam", "China", "Republic of Korea", "United States of America")) |> drop_na(gov_exp) |>
ggplot(aes(year, gov_exp, col = country)) + geom_line() +
labs(title = "Five Countries' Goverment Expenditure on Education")
df_un_ed_combined |> filter(country %in% c("Japan", "Viet Nam", "China", "Republic of Korea", "United States of America")) |> drop_na(tertiary)
df_un_ed_combined |> filter(country %in% c("Japan", "Viet Nam", "China", "Republic of Korea", "United States of America")) |> drop_na(tertiary) |>
ggplot(aes(year, tertiary, col = country, linetype = sex)) + geom_line()
df_un_ed_combined |> filter(country %in% c("Japan", "Viet Nam", "China", "Republic of Korea", "United States of America")) |> drop_na(tertiary) |>
ggplot(aes(year, tertiary, col = country, linetype = sex)) + geom_line() +
scale_y_log10() + labs(title = "Gross Enrolment Ratio in Tertiary Education", subtitle = "Log 10 Scale in Number")
df_un_ed_combined_ext <- df_un_ed_combined |> left_join(wdi_country_extra, by = 'iso3c')
df_un_ed_combined_ext
wdicache <- read_rds("data/wdicache.rds")
wdi_country_extra <- wdicache$country |> select(iso2c, region, income)
INCOME <- c("High income", "Upper middle income", "Lower middle income", "Low income")
BRICS <- c("Brazil", "Russian Federation", "India", "China", "South Africa")
G7 <- c("Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States")
url_un_R_D <- "https://data.un.org/_Docs/SYB/CSV/SYB66_285_202310_Research%20and%20Development%20Staff%20and%20researchers%20in%20full%20time%20equivalent.csv"
df_un_R_D <- read_csv(url_un_R_D, skip = 1)
New names:Rows: 983 Columns: 7── Column specification ─────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_un_R_D
str(df_un_R_D)
spc_tbl_ [983 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Region/Country/Area: num [1:983] 1 1 1 1 1 1 1 1 15 15 ...
$ ...2 : chr [1:983] "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" "Total, all countries or areas" ...
$ Year : num [1:983] 2005 2005 2010 2010 2015 ...
$ Series : chr [1:983] "Gross domestic expenditure on R&D: as a percentage of GDP" "Researchers per million inhabitants (FTE)" "Gross domestic expenditure on R&D: as a percentage of GDP" "Researchers per million inhabitants (FTE)" ...
$ Value : num [1:983] 1.5 907.2 1.6 1022.1 1.7 ...
$ Footnotes : chr [1:983] NA NA NA NA ...
$ Source : chr [1:983] "United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for S"| __truncated__ "United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for S"| __truncated__ "United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for S"| __truncated__ "United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for S"| __truncated__ ...
- attr(*, "spec")=
.. cols(
.. `Region/Country/Area` = col_double(),
.. ...2 = col_character(),
.. Year = col_double(),
.. Series = col_character(),
.. Value = col_number(),
.. Footnotes = col_character(),
.. Source = col_character()
.. )
- attr(*, "problems")=<externalptr>
df_un_R_D |> select(Series, Footnotes, Source) |> lapply(unique)
$Series
[1] "Gross domestic expenditure on R&D: as a percentage of GDP"
[2] "Researchers per million inhabitants (FTE)"
$Footnotes
[1] NA
[2] "Partial data."
[3] "Break in the time series."
[4] "Partial data.;Excluding private non-profit.;Excluding business enterprise."
[5] "Excluding private non-profit.;Excluding business enterprise."
[6] "Estimate."
[7] "Estimate.;Estimate."
[8] "Data have been converted from the former national currency using the appropriate conversion rate."
[9] "Overestimated or based on overestimated data.;Excluding most or all capital expenditures."
[10] "Overestimated or based on overestimated data."
[11] "Break in the time series.;Higher Education only."
[12] "Break in the time series.;Excluding business enterprise."
[13] "Break in the time series.;Partial data."
[14] "Partial data.;Higher Education only."
[15] "For statistical purposes, the data for China do not include those for the Hong Kong Special Administrative Region (Hong Kong SAR), Macao Special Administrative Region (Macao SAR) and Taiwan Province of China."
[16] "Estimate.;Partial data."
[17] "Partial data.;Excluding business enterprise."
[18] "Break in the time series.;Excluding business enterprise.;Overestimated or based on overestimated data."
[19] "Government only.;S&T budget instead of R&D expenditure."
[20] "Partial data.;Government only."
[21] "Break in the time series.;Partial data.;Excluding higher education and private non-profit."
[22] "Estimate.;Break in the time series.;Partial data."
[23] "Excluding private non-profit.;Excluding government."
[24] "Excluding private non-profit.;Excluding business enterprise.;Based on R&D budget instead of R&D expenditure."
[25] "Do not correspond exactly to Frascati Manual recommendations."
[26] "Break in the time series.;Overestimated or based on overestimated data."
[27] "Break in the time series.;Partial data.;Excluding private non-profit.;Excluding business enterprise."
[28] "Government only."
[29] "Excluding private non-profit."
[30] "Break in the time series.;Partial data.;Government only."
[31] "Excluding business enterprise."
[32] "Overestimated or based on overestimated data.;Based on R&D budget instead of R&D expenditure."
[33] "Partial data.;Excluding government."
[34] "Partial data.;Based on R&D budget instead of R&D expenditure."
[35] "Higher Education only."
[36] "Overestimated. Based on R&D budget instead of R&D expenditure."
[37] "Do not correspond exactly to Frascati Manual recommendations.;Excluding data from some regions, provinces or states."
[38] "Excluding data from some regions, provinces or states."
[39] "Estimate.;Overestimated or based on overestimated data."
[40] "The sum of the breakdown does not add to the total."
[41] "Estimate.;Break in the time series.;Excluding data from some regions, provinces or states."
[42] "Break in the time series.;Do not correspond exactly to Frascati Manual recommendations."
$Source
[1] "United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed April 2023."
Add iso2c code, and check those the names listed in …2 whose iso2c_un is NA. From the output, these are all region names. So in the next code, we delete them.
library(countrycode)
df_un_R_D_region <- df_un_R_D |>
mutate(iso2c = countrycode(`Region/Country/Area`, "un", "iso2c"), .after = ...2) |>
filter(is.na(iso2c))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso2c = countrycode(`Region/Country/Area`, "un", "iso2c")`.
Caused by warning:
! Some values were not matched unambiguously: 1, 9, 15, 21, 30, 34, 35, 53, 62, 143, 145, 150, 199, 202, 419, 432, 722
df_un_R_D_region |> distinct(...2)
df_un_R_D |>
mutate(iso2c = countrycode(`Region/Country/Area`, "un", "iso2c"), .after = ...2) |>
filter(!is.na(iso2c)) |>
select(country = ...2, iso2c, year = Year, series = Series, value = Value) |>
pivot_wider(names_from = series, values_from = value)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso2c = countrycode(`Region/Country/Area`, "un", "iso2c")`.
Caused by warning:
! Some values were not matched unambiguously: 1, 9, 15, 21, 30, 34, 35, 53, 62, 143, 145, 150, 199, 202, 419, 432, 722
The following is another way of renaming series names, an alternative
to use mutate-case_when. The operation pivot_wider is the
converse operation to pivot_longer.
The units of two indicators are different. One is in percent of GDP, and the other is the number of researchers in one million inhabitants. So it is better to handle the data in wide format instead of that in long data.
df_R_D_wide <- df_un_R_D |>
mutate(iso2c = countrycode(`Region/Country/Area`, "un", "iso2c"), .after = ...2) |>
filter(!is.na(iso2c)) |>
select(country = ...2, iso2c, year = Year, series = Series, value = Value) |>
pivot_wider(names_from = series, values_from = value) |>
rename(rd_exp = `Gross domestic expenditure on R&D: as a percentage of GDP`, researcher = `Researchers per million inhabitants (FTE)`)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `iso2c = countrycode(`Region/Country/Area`, "un", "iso2c")`.
Caused by warning:
! Some values were not matched unambiguously: 1, 9, 15, 21, 30, 34, 35, 53, 62, 143, 145, 150, 199, 202, 419, 432, 722
df_R_D_wide
df_R_D_region_wide <- df_un_R_D_region |> select(country = ...2, year = Year, series = Series, value = Value) |>
pivot_wider(names_from = series, values_from = value) |>
rename(rd_exp = `Gross domestic expenditure on R&D: as a percentage of GDP`, researcher = `Researchers per million inhabitants (FTE)`)
df_R_D_region_wide
df_R_D_wide_extra <- df_R_D_wide |> left_join(wdi_country_extra, by = "iso2c")
df_R_D_wide_extra
df_R_D_extra <- df_R_D_wide_extra |> pivot_longer(cols = c(rd_exp, researcher))
df_R_D_extra
It is important to check the number of data in each year.
df_R_D_extra |> drop_na(value) |>
ggplot(aes(year, fill = name)) + geom_bar() + labs(title = "Count of data")
df_R_D_wide_extra |> drop_na(rd_exp, researcher) |>
ggplot(aes(rd_exp, researcher, col = income)) + geom_point() +
labs(title = "Scatterplot: Expenditure on R&D vs Researchers per million", subtitle = "by income level")
df_R_D_wide_extra |> drop_na(rd_exp, researcher) |>
ggplot(aes(rd_exp, researcher, col = region)) + geom_point() +
labs(title = "Scatterplot: Expenditure on R&D vs Researchers per million", subtitle = "by region")
df_R_D_wide_extra |> drop_na(rd_exp, researcher) |> lm(researcher ~ rd_exp, data = _) |>
summary()
Call:
lm(formula = researcher ~ rd_exp, data = drop_na(df_R_D_wide_extra,
rd_exp, researcher))
Residuals:
Min 1Q Median 3Q Max
-3251.5 -498.3 -112.3 360.5 3450.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.11 72.54 0.539 0.59
rd_exp 1998.86 53.60 37.289 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 951.9 on 354 degrees of freedom
Multiple R-squared: 0.7971, Adjusted R-squared: 0.7965
F-statistic: 1390 on 1 and 354 DF, p-value: < 2.2e-16
df_R_D_wide_extra |> drop_na(rd_exp, researcher) |> select(rd_exp, researcher) |> cor()
rd_exp researcher
rd_exp 1.0000000 0.8927913
researcher 0.8927913 1.0000000
The value is the square root of the multiple R-squared above. There variables are strongly positively correlated.
df_R_D_wide_extra |> filter(year %in% c(2005, 2010, 2015, 2020)) |> drop_na(rd_exp) |>
filter(income %in% INCOME) |>
ggplot(aes(factor(income, levels = INCOME), rd_exp)) + geom_boxplot() +
labs(title = "Gross domestic expenditure on R&D: % of GDP by income level", x = "", y = "expenditure on R&D" )
df_R_D_wide_extra |> filter(year %in% c(2005, 2010, 2015, 2020)) |> drop_na(researcher) |>
filter(income %in% INCOME) |>
ggplot(aes(factor(income, levels = INCOME), researcher)) + geom_boxplot() +
labs(title = "Researchers per million inhabitants (FTE) by income level", x = "")
df_R_D_wide_extra |> filter(year %in% c(2005, 2010, 2015, 2020)) |> drop_na(rd_exp) |>
ggplot(aes(region, rd_exp)) + geom_boxplot() + coord_flip() +
labs(title = "Gross domestic expenditure on R&D: % of GDP ", x = "", y = "expenditure on R&D")
df_R_D_wide_extra |> filter(year %in% c(2005, 2010, 2015, 2020)) |> drop_na(researcher) |>
ggplot(aes(region, researcher)) + geom_boxplot() + coord_flip() +
labs(title = "Researchers per million inhabitants (FTE) by region", x = "")
The variable year is in integers which is a numerical
variable. So in order to take it as a categorical variable, we use
factor(year).
df_R_D_wide_extra |> filter(year %in% c(2005, 2010, 2015, 2020)) |> drop_na(rd_exp) |>
ggplot(aes(factor(year), rd_exp)) + geom_boxplot() +
labs(title = "Gross domestic expenditure on R&D: % of GDP ", x = "year", y = "expenditure on R&D")
df_R_D_wide_extra |> filter(year %in% c(2005, 2010, 2015, 2020)) |> drop_na(researcher) |>
ggplot(aes(factor(year), researcher)) + geom_boxplot() +
labs(title = "Researchers per million inhabitants (FTE) by year", x = "year", y = "expenditure on R&D")
df_R_D_wide_extra |> filter(country %in% G7) |> drop_na(rd_exp) |>
ggplot(aes(year, rd_exp, col = country)) + geom_line() +
labs(title = "G7 countries: Gross domestic expenditure on R&D: % of GDP ", y = "expenditure on R&D")
df_R_D_wide_extra |> filter(country %in% BRICS) |> drop_na(rd_exp) |>
ggplot(aes(year, rd_exp, col = country)) + geom_line() +
labs(title = "BRICS countries: Gross domestic expenditure on R&D: % of GDP ",y = "expenditure on R&D")
df_R_D_wide_extra |> filter(country %in% G7) |> drop_na(researcher) |>
ggplot(aes(year, researcher, col = country)) + geom_line() +
labs(title = "G7 countries: Researchers per million inhabitants (FTE)")
df_R_D_wide_extra |> filter(country %in% BRICS) |> drop_na(researcher) |>
ggplot(aes(year, researcher, col = country)) + geom_line() +
labs(title = "BRICS countries: Researchers per million inhabitants (FTE)")
df_R_D_region_wide |> distinct(country)
df_R_D_region_wide |> drop_na(rd_exp) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> ggplot(aes(year, rd_exp, col = country)) + geom_line() +
labs(title = "Gross domestic expenditure on R&D: % of GDP by region")
df_R_D_region_wide |> drop_na(researcher) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> ggplot(aes(year, researcher, col = country)) + geom_line() +
labs(title = "Researchers per million inhabitants (FTE) by region")
rd_exp_ranking <- df_R_D_region_wide |> filter(year == 2015) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> drop_na(rd_exp) |> arrange(rd_exp) |> pull(country)
rd_exp_ranking
[1] "Central Asia" "Sub-Saharan Africa"
[3] "Northern Africa" "Southern Asia"
[5] "Latin America & the Caribbean" "Western Asia"
[7] "South-eastern Asia" "Europe"
[9] "Oceania" "Eastern Asia"
[11] "Northern America"
df_R_D_region_wide |> filter(year == 2015) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> drop_na(rd_exp) |>
ggplot(aes(factor(country, levels = rd_exp_ranking), rd_exp)) + geom_col() + coord_flip() +
labs(title="Gross domestic expenditure on R&D: % of GDP by region in 2015", x = "", y = "")
researcher_ranking <- df_R_D_region_wide |> filter(year == 2015) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> drop_na(researcher) |> arrange(researcher) |> pull(country)
researcher_ranking
[1] "Sub-Saharan Africa" "Southern Asia"
[3] "South-central Asia" "Central Asia"
[5] "Latin America & the Caribbean" "South-eastern Asia"
[7] "Northern Africa" "Western Asia"
[9] "Eastern Asia" "Europe"
[11] "Oceania" "Northern America"
df_R_D_region_wide |> filter(year == 2015) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> drop_na(researcher) |>
ggplot(aes(factor(country, levels = researcher_ranking), researcher, fill= factor(country, levels = researcher_ranking))) + geom_col() + coord_flip() +
labs(title="Researchers per million inhabitants (FTE) by region in 2015", x = "", y = "") + theme(legend.position = "none")
df_R_D_region_wide |> drop_na(researcher) |>
filter(!(country %in% c("Total, all countries or areas", "Australia and New Zealand", "LLDCs", "LDC\xa7", "SIDS"))) |> ggplot(aes(year, researcher, col = country)) + geom_line() +
labs(title = "Researchers per million inhabitants (FTE) by region")
The R Notebook contains various technique on visualization. Link I
recommend rnaturalearth and rnaturalearthdata
packages introduced in class for choropleth maps.